Intermediate Workshop of the PRIN 2022 project, 5-6 February 2026, Bologna
outline
learning from dissimilarities
some unsupervised learning methods take as input a dissimilarity matrix
intuition
2 continuous variables: add up by-variable (absolute value or squared) differences
intuition
2 continuous variables: add up by-variable (absolute value or squared) differences
intuition
2 continuous variables: add up by-variable (absolute value or squared) differences
intuition
2 continuous variables: add up by-variable (absolute value or squared) differences
intuition
2 continuous variables: add up by-variable (absolute value or squared) differences
intuition
2 continuous and 1 categorical variables
intuition
one might consider purple and blue closer than e.g. purple and yellow
independence-based
Most commonly used dissimilarity (or, distance) measures are based on by-variable differences that are then added together
independence-based
When variables are correlated or associated, shared information is effectively counted multiple times
inflated dissimilarities may cause potential distortions in downstream unsupervised learning tasks.
independence-based
When variables are correlated or associated, shared information is effectively counted multiple times
inflated dissimilarities may cause potential distortions in downstream unsupervised learning tasks.
independence-based
The Euclidean distance \(\longrightarrow\) shared information is over-counted
association-based
The Mahalanobis distance \(\longrightarrow\) shared information is not over-counted
association-based pairwise distance
Association-based for continuous: Mahalanobis distance
Let \({\bf X}_{con}\) be \(n\times Q_{d}\) a data matrix of \(n\) observations described by \(Q_{d}\) continuous variables, and let \(\bf S\) the sample covariance matrix, the Mahalanobis distance matrix is
\[ {\bf D}_{mah} = \left[\operatorname{diag}({\bf G})\,{\bf 1}_{n}^{\sf T} + {\bf 1}_{n}\,\operatorname{diag}({\bf G})^{\sf T} - 2{\bf G}\right]^{\odot 1/2} \] where
\([\cdot]^{\odot 1/2}\) denotes the element-wise square root
\({\bf G}=({\bf C}{\bf X}_{con}){\bf S}^{-1}({\bf C}{\bf X}_{con})^{\sf T}\) is the Mahalanobis Gram matrix
\({\bf C}={\bf I}_{n}-\tfrac{1}{n}{\bf 1}_{n}{\bf 1}_{n}^{\sf T}\) is the centering operator
association-based pairwise distance
Association-based for categorical: total variation distance (TVD)1
To distance matrix \({\bf D}_{tvd}\) is defined using the so-called delta framework2 a general way to define categorical data distances.
Let \({\bf X}_{cat}\) be \(n\times Q_{c}\) a data matrix of \(n\) observations described by \(Q_{c}\) categorical variables.
\[ {\bf D} = {\bf Z}{\Delta}{\bf Z}^{\sf T} = \left[\begin{array}{ccc} {\bf Z}_{1} & \dots & {\bf Z}_{Q_{c}} \end{array} \right]\left[\begin{array}{ccc} {\bf\Delta}_1 & & \\ & \ddots &\\ & & {\bf\Delta}_{Q_{c}} \end{array} \right] \left[ \begin{array}{c} {\bf Z}_{1}^{\sf T}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T} \end{array} \right] \] - where \({\bf Z}=[{\bf Z}_1,\ldots,{\bf Z}_{Q_c}]\) is the super-indicator matrix, with \(Q^{*}=\sum_{j=1}^{Q_c} q_j\)
\({\Delta}_j\) is the category dissimilarity matrix for variable \(j\), i.e., the \(j\)th diagonal block of the block-diagonal matrix \({\Delta}\).
setting \({\Delta}_j\) determines the categorical distance measure of choice (independent- or association-based)
association-based pairwise distance
Association-based for categorical: total variation distance (TVD)1 (2)
Consider the empirical joint probability distributions stored in the off-diagonal blocks of \({\bf P}\):
\[ {\bf P} = \frac{1}{n} \begin{bmatrix} {\bf Z}_1^{\sf T}{\bf Z}_1 & {\bf Z}_1^{\sf T}{\bf Z}_2 & \cdots & {\bf Z}_1^{\sf T}{\bf Z}_{Q_c} \\ \vdots & \ddots & \vdots & \vdots \\ {\bf Z}_{Q_c}^{\sf T}{\bf Z}_1 & {\bf Z}_{Q_c}^{\sf T}{\bf Z}_2 & \cdots & {\bf Z}_{Q_c}^{\sf T}{\bf Z}_{Q_c} \end{bmatrix}. \]
We refer to the conditional probability distributions for each variable \(j\) given each variable \(i\) (\(i,j=1,\ldots,Q_c\), \(i\neq j\)), stored in the block matrix
\[ {\bf R} = {\bf P}_z^{-1}({\bf P} - {\bf P}_z). \]
where \({\bf P}_z = {\bf P} \odot {\bf I}_{Q^*}\), and \({\bf I}_{Q^*}\) is the \(Q^*\times Q^*\) identity matrix.
association-based pairwise distance
Association-based for categorical: total variation distance (TVD)1 (3)
Let \({\bf r}^{ji}_a\) and \({\bf r}^{ji}_b\) be the rows of \({\bf R}_{ji}\), the \((j,i)\)th off-diagonal block of \({\bf R}\) The category dissimilarity between \(a\) and \(b\) for variable \(j\) based on the total variation distance (TVD) is defined as
\[ \delta^{j}_{tvd}(a,b) = \sum_{i\neq j}^{Q_c} w_{ji} \Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b}) = \sum_{i\neq j}^{Q_c} w_{ji} \left[\frac{1}{2}\sum_{\ell=1}^{q_i} |{\bf r}^{ji}_{a\ell}-{\bf r}^{ji}_{b\ell}|\right], \label{ab_delta} \]
where \(w_{ji}=1/(Q_c-1)\) for equal weighting (can be user-defined).
TVD-based dissimilarity matrix is, therefore,
\[ {\bf D}_{tvd}= {\bf Z}{\Delta}^{(tvd)}{\bf Z}^{\sf T}. \]
association-based for mixed
A straightforward AB-distance for mixed data is given by the convex combination of Mahalanobis and TVD distances:
\[ {\bf D}_{mix} =\frac{Q_{d}}{Q}\,{\bf D}_{mah} +\left(1-\frac{Q_{d}}{Q}\right){\bf D}_{tvd}. \]
this distance only accounts for correlations or associations among variables of the same type
no continuous–categorical interactions are considered.
how to measure interactions
define \(\Delta^{int}\), that accounts for the interactions and augment \(\Delta^{tvd}\)
\[ {\bf D}_{mix}^{(int)} = {\bf D}_{mah} + {\bf D}_{cat}^{(int)}. \]
where
\[ {\bf D}_{cat}^{(int)}={\bf Z}\tilde{\Delta}{\bf Z}^\top \] and
\[ \tilde{\Delta} = (1-\alpha)\Delta^{tvd} + \alpha \Delta^{int} \] where \(\alpha=\frac{1}{Q_{c}}\).
how to measure interactions
What is \(\Delta^{int}\)?
\(\Delta^{int}_{j}\)
consider \({\bf D}_{mah}\) and sort it to identify the neighbors for each observation.
set a proportion of neighbors to consider, say \(\hat{\pi}_{nn}=0.1\)
for each pair of categories \((a,b)\), \(a,b=1,\ldots,q_{j}\), \(a\neq b\) of the \(j^{th}\) categorical variable:
classify the observations using the prior corrected1 decision rule
\[ \text{if $i$ is such that }\ \ \ \frac{\hat{\pi}_{nn}(a)}{\hat{\pi}(a)}\geq\frac{\hat{\pi}_{nn}(b)}{\hat{\pi}(b)} \ \ \ \text{ then assign $i$ to class $a$ else to class $b$} \]
compute the balanced accuracy2 (average of class-wise sensitivities) \[ \delta_{int}^{j}(a,b)=\frac{1}{2}\left(\frac{\texttt{true } a}{\texttt{true } a + \texttt{false }a}+\frac{\texttt{true } b}{\texttt{true } b + \texttt{false }b}\right) \]
well separated or not
Building \(\Delta^{int}_{j}\)
for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs
\[ \Delta_{int} = \begin{pmatrix} 0 & \cdot & \cdot & \cdot \\ \cdot & 0 & \cdot & \cdot \\ \cdot & \cdot & 0 & \cdot\\ \cdot & \cdot & \cdot & 0 \end{pmatrix} \]
Building \(\Delta^{int}_{j}\)
for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs
\[ \Delta_{int} = \begin{pmatrix} 0 & \color{indianred}{0.94} & \cdot & \cdot \\ \color{indianred}{0.94} & 0 & \cdot & \cdot \\ \cdot & \cdot & 0 & \cdot\\ \cdot & \cdot & \cdot & 0 \end{pmatrix} \]
Building \(\Delta^{int}_{j}\)
for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs
\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & \color{indianred}{0.4} & \cdot \\ 0.94 & 0 & \cdot & \cdot \\ \color{indianred}{0.4} & \cdot & 0 & \cdot\\ \cdot & \cdot & \cdot & 0 \end{pmatrix} \]
Building \(\Delta^{int}_{j}\)
for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs
\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.4 & \color{indianred}{0.39} \\ 0.94 & 0 & \cdot & \cdot \\ 0.4 & \cdot & 0 & \cdot\\ \color{indianred}{0.39} & \cdot & \cdot & 0 \end{pmatrix} \]
Building \(\Delta^{int}_{j}\)
for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs
\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.4 & 0.39 \\ 0.94 & 0 & \color{indianred}{0.54} & \cdot \\ 0.4 & \color{indianred}{0.54} & 0 & \cdot \\ 0.39 & \cdot & \cdot & 0 \end{pmatrix} \]
Building \(\Delta^{int}_{j}\)
for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs
\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.4 & 0.39 \\ 0.94 & 0 & 0.54 & \color{indianred}{0.55} \\ 0.4 & 0.54 & 0 & \cdot \\ 0.39 & \color{indianred}{0.55} & \cdot & 0 \end{pmatrix} \]
Building \(\Delta^{int}_{j}\)
for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs
\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.4 & 0.39 \\ 0.94 & 0 & 0.54 & 0.55 \\ 0.4 & 0.54 & 0 & \color{indianred}{0} \\ 0.39 & 0.55 & \color{indianred}{0} & 0 \end{pmatrix} \]
Just one-way interaction?
Let \({\bf X}=\left[{\bf X}_{con}{\bf X}_{cat} \right]\) be a mixed data matrix with \(n\) observations described by \(Q_{d}\) continuous and \(Q_{c}\) categorical variables, respectively, and let \({\bf x}_{i}=\left[{\bf x}_{i_{con}}{\bf x}_{i_{cat}}\right]\) the \(i^{th}\) observation
We build upon the following results
result A
The distribution of \({\bf x}_{i}\) can be written as
\[ f({\bf x}_{i_{con}},{\bf x}_{i_{cat}})=f({\bf x}_{i_{con}})f({\bf x}_{i_{cat}}\mid{\bf x}_{i_{con}}) \]
result B
According to Hennig et al. (2019)1, starting from an arbitrary distance \(d(x, c_{k})\) from a prototype, it possible to construct a probabilistic clustering model2 as \(f(x;c_{k},s_{k})=g(c_{k},s_{k})exp(-s_{k}d(x,c_{k})\), where \(s_{k}\) is a concentration parameter.
Then if \(f(\cdot)\sim N({\bf c}_{k}, {\bf S}_{k})\), and \(d(\cdot,\cdot)\) the Mahalanobis distance, it results that
\[ f({\bf x}_{i};{\bf c}_k, {\bf S}_k)=g({\bf c}_k, {\bf S}_k)\exp\left[-d({\bf x}_{i},{\bf c}_k,)\right] \] where \(g({\bf c}_k, {\bf S}_k)\) is a normalization constant to make sure \(f({\bf x}_{i};{\bf c}_k, {\bf S}_k)\) is a density function.
Just one-way interaction?
result C
the dissimilarity between \({\bf x}_{i_{con}}\) and a generic cluster \(k\) with center \({\bf c}_k\) and covariance matrix \({\bf S}_k\) is1
\[ d({\bf x}_{i_{con}},{\bf c}_k)=\log(M_{k} f({\bf x}_{i_{con}};{\bf c}_k, {\bf S}_k)^{-1}) \]
\(f(\cdot)\) a symmetric probability density function
\(M_k\) is the maximum of the density function
that is, if \({\bf x}_{i_{con}}={\bf c}_k\) then \(d({\bf x}_{i_{con}},{\bf c}_k)=0\).
replace \({\bf c}_{k}\) with a generic observation \(i'\), the above becomes
\[ d({\bf x}_{i_{con}},{\bf x}_{i'_{con}})=\log(M f({\bf x}_{i_{con}};{\bf x}_{i'_{con}}, {\bf S})^{-1}) \]
where \(M\) is the maximum of the density function.
Just one-way interaction…
Using the result C, it can be shown that
\[ d({\bf x}_{i_{con}},{\bf x}_{i'_{con}})=\frac{1}{2}({\bf x}_{i_{con}}-{\bf x}_{i'_{con}}) {\bf S}^{-1}({\bf x}_{i_{con}}-{\bf x}_{i'_{con}})^{\sf T} \]
categorical analogue
consider the \(Q_{cat}\)-dimensional categorical vector \({\bf x}_{i_{cat}}\), it results, for its \(j^{th}\) element, that
\[ p(x_{ij_{cat}}=a| x_{i'j_{cat}}=b) = \left[\delta^{j}(a,b)\right]^{-1} \]
where \(a\) and \(b\) are two general categories of the \(jth\) variable
For the whole vector it results \[ p({\bf x}_{ij_{cat}}; {\bf x}_{i'j_{cat}}) = \prod_{j=1}^{Q_c} p(x_{ij_{cat}}=a_{j}; x_{i'j_{cat}}=b_{j})=\prod_{j=1}^{Q_c}\left[\delta^{j}(a_{j},b_{j})\right]^{-1} \]
Just one-way interaction… (wrap-up)
using the previous results, it is possible to define the dissimilarity between two mixed-data observations
\({\bf x}_i\) and \({\bf x}_{i'}\)
\[ d({\bf x}_{i}, {\bf x}_{i'})=\frac{1}{2}({\bf x}_{i_{con}}-{\bf x}_{i'_{con}}) {\bf S}^{-1}({\bf x}_{i_{con}}-{\bf x}_{i'_{con}})^{\sf T}-\log(p({\bf x}_{i_{cat}}|{\bf x}_{i_{con}}; {\bf x}_{i'_{cat}}, {\bf x}_{i'_{con}} )) \]
that takes into account correlations, associations and cross-type interactions and is a equivalent to the genral entry of the previously defined
\[ {\bf D}_{mix}^{(int)} = {\bf D}_{mah} + {\bf D}_{cat}^{(int)}. \]
Spectral clustering: a graph partitioning problem
Graph representation
a graph representation of the data matrix \(\bf X\): the aim is then to cut it into K groups (clusters)
the affinity matrix \({\bf A}\)
the elements \(\bf w_{ij}\) of \(\bf A\) are high (low) if \(i\) and \(j\) are in the same (different) groups
| . | a | b | c | d |
|---|---|---|---|---|
| a | 0 | 0 | w_ac | 0 |
| b | 0 | 0 | w_cb | w_bd |
| c | w_ca | w_cb | 0 | w_cd |
| d | 0 | w_db | w_dc | 0 |
Spectral clustering: making the graph easy to cut
An approximated solution to the graph partitioning problem:
\[\color{dodgerblue}{\bf{L}} = {\bf D}_{r}^{-1/2}\underset{\color{grey}{\text{affinity matrix } {\bf A}}}{\color{dodgerblue}{exp(-{\bf D}^{2}(2\sigma^{2})^{-1})}}{\bf D}_{r}^{-1/2} =\color{dodgerblue}{{\bf Q}{\Lambda}{\bf Q}^{\sf T}}\]
\(\bf D\) be the \(n\times n\) matrix of pairwise distances
the \(\sigma\) parameter dictates the number of neighbors each observation is linked to (rule of thumb: median distance to the 20th nearest neighbor)
diagonal terms of \(\bf A\) are set to zero: \(a_{ii}=0\) , \(i=1,\ldots,n\)
\({\bf D}_{r}=diag({\bf r})\) , \({\bf r}={\bf A}{\bf 1}\) and \({\bf 1}\) is an \(n\)-dimensional vector of 1’s
Spectral clustering: solution and performance
SC works well, particularly in case of non-convex and overlapping clusters1
toy experiment: data generation
is \(\delta_{int}^{j}(a,b)\) of help?
three continuous signal, three continuous of Gaussian noise, three categorical variables (one noise)
different dependence structures and location shifts in the signal continuous across the groups defined by the signal categorical variables.
the continuous signal variables are generated conditionally on the signal categorical variables
\[ X_j = \beta_{0,hq} + \sum_{k>j}^{6} \beta_{1,hq} X_k, \qquad j = 1,2,3 \] where \(\beta_{1,hq}\) takes different values depending on the observed categories of the two signal categorical variables, \(h\) and \(q\), respectively.
toy experiment
toy experiment: compared methods
gower dissimilarity: a straightforward option
naive 1
modified gower2
association-based approach3: with and without interaction
toy experiment
Final considerations and future work
association-based measures aim to go beyond match/mismatch of categories
when the signal is limited to few variables, retrieving information from cont/cat interaction may be useful
measuring interactions via non-parametric approach NN-based approach is suitable for non-convex/oddly shaped clusters
computationally demanding (but it can be made bearable)
\(\pi_{nn}\) tuning, regularization of \(\delta_{int}\)’s to reduce variability
The manydist package is designed around a modular distance-based workflow.
| Component | Function / Interface | Role |
|---|---|---|
| Core distance constructor | mdist() |
Computes a pairwise dissimilarity matrix for mixed-type data: different indipendence-based and association based are specified, fully customizable |
| Recipe step | step_mdist() |
Preprocessing step that computes an mdist dissimilarity matrix within a recipe, for seamless integration in learning pipelines. |
| k-nearest neighbors | mdist_knn() |
Model specification for k-NN using a precomputed mdist dissimilarity matrix as input. Supports tune |
| leave-one-variable-out | lovo_mdist() |
LOVO dissimilarity computations to assess by-variable contributions |
| Component | Function / Interface | Role |
|---|---|---|
| Partitioning around medoids | mdist_pam() |
Model specification for PAM clustering based on mdist dissimilarities. |
| spectral clustering | mdist_spectral() |
Model specification for spectral clustering based on mdist dissimilarities. |
| and more to come… |
main references
Hennig, C., C. Viroli, and L. Anderlucci (2019). “Quantile-based clustering”. In: Electronic Journal of Statistics 13.2, pp. 4849 - 4883. DOI: 10.1214/19-EJS1640. URL: https://doi.org/10.1214/19-EJS1640.
Le, S. Q. and T. B. Ho (2005a). “An association-based dissimilarity measure for categorical data”. In: Pattern Recognition Letters 26.16, pp. 2549-2557.
Mbuga, F. and C. Tortora (2021a). “Spectral Clustering of Mixed-Type Data”. In: Stats 5.1, pp. 1-11.
Murugesan, N., I. Cho, and C. Tortora (2021). “Benchmarking in cluster analysis: a study on spectral clustering, DBSCAN, and K-Means”. In: Conference of the International Federation of Classification Societies. Springer. , pp. 175-185.
Velden, M. van de, A. Iodice D’Enza, A. Markos, et al. (2024). “A general framework for implementing distances for categorical variables”. In: Pattern Recognition 153, p. 110547.
Velden, M. van de, A. Iodice D’Enza, A. Markos, et al. (2025a). “A general framework for unbiased mixed-variables distances”. In: second round review: Journal of Computational and Graphical Statistics.
The publication was produced with funding from the Italian Ministry of University and Research under the Call for Proposals related to the scrolling of the final rankings of the PRIN 2022 call.
Latent variable models and dimensionality reduction methods for complex data (PI Prof. Paolo Giordani)
Project No. 20224CRB9E, CUP C53C24000730006
Grant Assignment Decree No. 1401 adopted on 18.9.2024 by MUR